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1  Introduction 

This  paper  presents  a  classifier  stacking-based  ap¬ 
proach  to  the  named  entity  recognition  task  (NER 
henceforth).  Transformation-based  learning  (Brill, 
1995),  Snow  (sparse  network  of  winnows  (Munoz 
et  al.,  1999))  and  a  forward-backward  algorithm  are 
stacked  (the  output  of  one  classifier  is  passed  as  in¬ 
put  to  the  next  classifier),  yielding  considerable  im¬ 
provement  in  performance.  In  addition,  in  agree¬ 
ment  with  other  studies  on  the  same  problem,  the 
enhancement  of  the  feature  space  (in  the  form  of 
capitalization  information)  is  shown  to  be  especially 
beneficial  to  this  task. 

2  Computational  Approaches 

All  approaches  to  the  NER  task  presented  in  this 
paper,  except  the  one  presented  in  Section  3,  use  the 
lOB  chunk  tagging  method  (Tjong  Kim  Sang  and 
Veens tra,  1999)  for  identifying  the  named  entities. 

2.1  Feature  Space  and  Baselines 

A  careful  selection  of  the  feature  space  is  a  very 
important  part  of  classifier  design.  The  algorithms 
presented  in  this  paper  are  using  only  informa¬ 
tion  that  can  be  extracted  directly  from  the  train¬ 
ing  data:  the  words,  their  capitalization  informa¬ 
tion  and  the  chunk  tags.  While  they  can  defi¬ 
nitely  incorporate  additional  information  (such  as 
lists  of  countries/cities/regions,  organizations,  peo¬ 
ple  names,  etc.),  due  to  the  short  exposition  space, 
we  decided  to  restrict  them  to  this  feature  space. 

Table  2  presents  the  results  obtained  by  running 
off-the-shelf  part-of-speech/text  chunking  classi¬ 
fiers;  all  of  them  use  just  word  information,  albeit 
in  different  ways.  The  leader  of  the  pack  is  the  MX- 
POST  tagger  (Ratnaparkhi,  1996).  The  measure  of 
choice  for  the  NER  task  is  E-measure,  the  harmonic 
mean  of  precision  and  recall:  usu¬ 

ally  computed  with  ,0  =  1. 

As  observed  by  participants  in  the  MUC-6  and  -7 
tasks  (Bikel  et  al.,  1997;  Borthwick,  1999;  Miller  et 


1 :  Capitalization  information 

2:  Presence  in 
dictionary 

first_cap,  all_caps,  alMower, 
number,  punct,  other 

upper,  lower, 
both,  none 

Table  1 :  Capitalization  information 
al.,  1998),  an  important  feature  for  the  NER  task  is 
information  relative  to  word  capitalization.  In  an 
approach  similar  to  Zhou  and  Su  (2002),  we  ex¬ 
tracted  for  each  word  a  2-byte  code,  as  summarized 
in  Table  1.  The  first  byte  specifies  the  capitaliza¬ 
tion  of  the  word  (first  letter  capital,  etc),  while  the 
second  specifies  whether  the  word  is  present  in  the 
dictionary  in  lower  case,  upper  case,  both  or  neither 
forms.  These  two  codes  are  extracted  in  order  to  of¬ 
fer  both  a  way  of  backing-off  in  sparse  data  cases 
(unknown  words)  and  a  way  of  encouraging  gen¬ 
eralization.  Table  2  shows  the  performance  of  the 
fnTBE  (Ngai  and  Elorian,  2001)  and  Snow  systems 
when  using  the  capitalization  information,  both  sys¬ 
tems  displaying  considerably  better  performance. 

2.2  Transformation-Based  Learning 

Transformation-based  learning  (TBE  henceforth)  is 
an  error-driven  machine  learning  technique  which 
works  by  first  assigning  an  initial  classification  to 
the  data,  and  then  automatically  proposing,  evalu¬ 
ating  and  selecting  the  transformations  that  max¬ 
imally  decrease  the  number  of  errors.  Each  such 
transformation,  or  rule,  consists  of  a  predicate  and 
a  target.  In  our  implementation  of  TBE  -  fnTBE  - 
predicates  consist  of  a  conjunction  of  atomic  pred¬ 
icates,  such  as  feature  identity  (e.g.  wordo  = 
Barcelona),  membership  in  a  set  (e.g.  B  —  ORG  G 
{chunk-3  ■  ■  ■  chunk-i}),  etc. 

TBE  has  some  attractive  qualities  that  make  it 
suitable  for  the  language-related  tasks:  it  can  au¬ 
tomatically  integrate  heterogenous  types  of  knowl¬ 
edge,  without  the  need  for  explicit  modeling  (simi¬ 
lar  to  Snow,  Maximum  Entropy,  decision  trees,  etc); 
it  is  error-driven,  therefore  directly  minimizes  the 
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Table  2:  Comparative  results  for  different  methods  on  the 
Spanish  development  data 

ultimate  evaluation  measure:  the  error  rate;  and  it 
has  an  inherently  dynamic  behavior^ .  TBL  has  been 
previously  applied  to  the  English  NER  task  (Ab¬ 
erdeen  et  ah,  1995),  with  good  results. 

The  fnTBE-based  NER  system  is  designed  in  the 
same  way  as  Brill’s  POS  tagger  (Brill,  1995),  con¬ 
sisting  of  a  morphological  stage,  where  unknown 
words’  chunks  are  guessed  based  on  their  morpho¬ 
logical  and  capitalization  representation,  followed 
by  a  contextual  stage,  in  which  the  full  interaction 
between  the  words’  features  is  leveraged  for  learn¬ 
ing.  The  feature  templates  used  are  based  on  a  com¬ 
bination  of  word,  chunk  and  capitalization  informa¬ 
tion  of  words  in  a  7- word  window  around  the  target 
word.  The  entire  template  list  (133  templates)  will 
be  made  available  from  the  author’s  web  page  after 
the  conclusion  of  the  shared  task. 

2.3  Snow 

Snow  -  Sparse  Network  of  Winnows  -  is  an  archi¬ 
tecture  for  error-driven  machine  learning,  consisting 
of  a  sparse  network  of  linear  separator  units  over 
a  common  predefined  or  incrementally  learned  fea¬ 
ture  space.  The  system  assigns  weights  to  each  fea¬ 
ture,  and  iteratively  updates  these  weights  in  such 
a  way  that  the  misclassification  error  is  minimized. 
Eor  more  details  on  Snow’s  architecture,  please  re¬ 
fer  to  Munoz  et  al.  (1999). 

Table  2  presents  the  results  obtained  by  Snow  on 
the  NER  task,  when  using  the  same  methodology 
from  Munoz  et  al.  (1999),  with  the  their  templates^ 
and  with  the  same  templates  as  fnTBE. 

*The  quality  of  chunk  tags  evolves  as  the  algorithm  pro¬ 
gresses;  there  is  no  mismatch  between  the  quality  of  the  sur¬ 
rounding  chunks  during  training  and  testing. 

^In  this  experiment,  we  used  the  feature  patterns  described 
in  Munoz  et  al.  (1999):  a  combination  of  up  to  2  words  in  a 
3-word  window  around  the  target  word  and  a  combination  of 
up  to  4  chunks  in  a  7-word  window  around  the  target  word.  All 
throughout  the  paper,  Snow’s  default  parameters  were  used. 


Figure  1:  Performance  of  applying  Snow  to  TBL’s  out¬ 
put,  plotted  against  iteration  number 

2.4  Stacking  Classifiers 

Both  the  fnTBE  and  the  Snow  methods  have 
strengths  and  weaknesses: 

•  fnTBE’s  strength  is  represented  by  its  dynamic 
modeling  of  chunk  tags  -  by  starting  in  a  sim¬ 
ple  state  and  using  complex  feature  interac¬ 
tions,  it  is  able  to  reach  a  reasonable  end-state. 
Its  weakness  consists  in  its  acute  myopia:  the 
optimization  is  done  greedily  for  the  local  con¬ 
text,  and  the  feature  interaction  is  observed 
only  in  the  order  in  which  the  rules  are  se¬ 
lected. 

•  Snow’s  strength  consists  in  its  ability  to  model 
interactions  between  the  all  features  associated 
with  a  sample.  However,  in  order  to  obtain 
good  results,  the  system  needs  reliable  contex¬ 
tual  information.  Since  the  approach  is  not  dy¬ 
namic  by  nature,  good  initial  chunk  classifica¬ 
tions  are  needed. 

One  way  to  address  both  weaknesses  is  to  com¬ 
bine  the  two  approaches  through  stacking,  by  ap¬ 
plying  Snow  on  fnTBE’s  output.  This  allows  Snow 
to  have  access  to  reasonably  reliable  contextual  in¬ 
formation,  and  also  allows  the  output  of  fnTBE 
to  be  corrected  for  multiple  feature  interaction. 
This  stacking  approach  has  an  intuitive  interpreta¬ 
tion:  first,  the  corpus  is  dynamically  labeled  us¬ 
ing  the  most  important  features  through  fnTBE 
rules  (coarse-grained  optimization),  and  then  is  fine¬ 
grained  tuned  through  a  few  full-feature-interaction 
iterations  of  Snow. 

Table  2  contrasts  stacking  Snow  and  fnTBE  with 
running  either  fnTBE  or  Snow  in  isolation  -  an  im¬ 
provement  of  1.6  E-measure  points  is  obtained  when 
stacking  is  applied.  Interestingly,  as  shown  in  Eig- 
ure  1,  the  relation  between  performance  and  Snow- 
iteration  number  is  not  linear:  the  system  initially 
takes  a  hit  as  it  moves  out  of  the  local  fnTBE  maxi¬ 
mum,  but  then  proceeds  to  increase  its  performance. 
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Accuracy 

U/3  =  l 

Spanish 

98.42% 

90.26 

Dutch 

98.54% 

88.03 

Method 

Spanish 

Dutch 

FB  performance 

76.49 

73.30 

FB  on  perfect  chunk  breaks 

83.52 

81.30 

Table  3:  Unlabeled  chunking  results  obtained  by  fnTBL 
on  the  development  sets 


Table  4:  Forward-Backward  results  (F-measure)  on  the 
development  sets 


finally  converging  after  10  iterations  to  a  F-measure 
value  of  73.49. 


For  each  marked  entity  Cj,  the  goal  is  to  determine 
its  most  likely  type:"^ 


3  Breaking-Up  the  Task 

Munoz  et  al.  (1999)  examine  a  different  method  of 
chunking,  called  Open/Close  (0/C)  method:  2  clas¬ 
sifiers  are  used,  one  predicting  open  brackets  and 
one  predicting  closed  brackets.  A  final  optimiza¬ 
tion  stage  pairs  open  and  closed  brackets  through  a 
global  search. 

We  propose  here  a  method  that  is  similar  in 
spirit  to  the  0/C  method,  and  also  to  Carreras  and 
Marquez  (2001),  Arevalo  et  al.  (2002): 

1.  In  the  first  stage,  detect  only  the  entity  bound¬ 
aries,  without  identifying  their  type,  using  the 
fnTBL  system^ ; 

2.  Using  a  forward-backward  type  algorithm  (FB 
henceforth),  determine  the  most  probable  type 
of  each  entity  detected  in  the  first  step. 

This  method  has  some  enticing  properties: 

•  Detecting  only  the  entity  boundaries  is  a  sim¬ 
pler  problem,  as  different  entity  types  share 
common  features;  Table  3  shows  the  perfor¬ 
mance  obtained  by  the  fnTBL  system  -  the  per¬ 
formance  is  sensibly  higher  than  the  one  shown 
in  Table  2; 

•  The  FB  algorithm  allows  for  a  global  search 
for  the  optimum,  which  is  beneficial  since  both 
fnTBL  and  Snow  perform  only  local  optimiza¬ 
tions; 

•  The  FB  algorithm  has  access  to  both  entity- 
internal  and  external  contextual  features  (as 
first  described  in  McDonald  (1996));  further¬ 
more,  since  the  chunks  are  collapsed,  the  local 
area  is  also  larger  in  span. 

The  input  to  the  FB  algorithm  consists  of  a  series 
of  chunks  Ci, . . .  ,  each  spanning  a  sequence  of 
words 

Wi...  Wh^-i  Wb^  .  .  .Wei  .  .  .Wb^  .  .  .We^  .  .  .  ‘^m 

' - V - '  ' - V - ' 

_ Cl  Cn 

^For  this  task,  Snow  does  not  bring  any  improvement  to  the 
fnTBL’s  output. 


Ej  =  argmaxis^.  ^  P  {Ei\wY')  = 

argmaxB^.  ^  P  +i)  ■ 

(1) 

P  (len 


where  P  (mj'  ^Pi . . .  +i)  represents  the 

entity-external/contextual  probability,  and 
P  (^len  \Ej'^  P  (wl^,\Ej'j  is  the  entity-internal 


probability.  These  probabilities  are  computed 
using  the  standard  Markov  assumption  of  inde¬ 
pendence,  and  the  forward-backward  algorithm?. 
Both  internal  and  external  models  are  using  5 -gram 
language  models,  smoothed  using  the  modified 
discount  method  of  Chen  and  Goodman  (1998). 
In  the  case  of  unseen  words,  backoff  to  the  cap¬ 
italization  tag  is  performed:  if  is  unknown, 
P  {wk\Ej)  =  P  {capit  (wk)  \Ej).  Finally,  the 


probability  P  (j,en  [wl^^ 


is  assumed  to  be 


exponentially  distributed. 

Table  4  shows  the  results  obtained  by  stacking 
the  FB  algorithm  on  top  of  fnTBL.  Comparing 
the  results  with  the  ones  in  Table  2,  one  can  ob¬ 
serve  that  the  global  search  does  improve  the  perfor¬ 
mance  by  3  F-measure  points  when  compared  with 
fnTBL-i-Snow  and  5  points  when  compared  with  the 
fnTBL  system.  Also  presented  in  Table  4  is  the  per¬ 
formance  of  the  algorithm  on  perfect  boundaries; 
more  than  6  F-measure  points  can  be  gained  by 
improving  the  boundary  detection  alone.  Table  5 
presents  the  detailed  performance  of  the  FB  algo¬ 
rithm  on  all  four  data  sets,  broken  by  entity  type. 

A  quick  analysis  of  the  results  revealed  that  most 
errors  were  made  on  the  unknown  words,  both  in 


use  the  notation  luj™  =  wi  . . .  Wm- 
^It  is  notable  here  that  the  best  entity  type  for  a  chunk  is 
computed  by  selecting  the  best  entity  in  all  combinations  of 
the  other  entity  assignments  in  the  sentence.  This  choice  is 
made  because  it  reflects  better  the  scoring  method,  and  makes 
the  algorithm  more  similar  to  the  HMM’s  forward-backward 
algorithm  (Jelinek,  1997,  chapter  13)  rather  than  the  Viterbi 
algorithm. 


Spanish  and  Dutch:  the  accuracy  on  known  words  is 
97.4%/98.9%  (Spanish/Dutch),  while  the  accuracy 
on  unknown  words  is  83.4%/85.1%.  This  suggests 
that  lists  of  entities  have  the  potential  of  being  ex¬ 
tremely  beneficial  for  the  algorithm. 

4  Conclusion 

In  conclusion,  we  have  presented  a  classifier  sfack- 
ing  mefhod  which  uses  fransformafion-based  learn¬ 
ing  fo  obfain  a  course-grained  inifial  enfify  anno- 
fafion,  fhen  applies  Snow  fo  improve  fhe  classi- 
ficalion  on  samples  where  fhere  is  sfrong  fealure 
inferacfion  and,  finally,  uses  a  forward-backward 
algorifhm  fo  compufe  a  global-besf  enfify  fype 
assignmenf.  By  using  fhe  pipelined  processing, 
fhis  mefhod  improves  fhe  performance  subsfan- 
fially  when  compared  wifh  fhe  original  algorifhms 
(fnTBL,  Snow-tfnTBL). 
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